Describe the purpose of the data set you selected (i.e.,why was this data collected in the first place?). How will you measure the effectiveness of a good algorithm? Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?
This project is based off of a rainfal prediction dataset from Australia. The original data file named 'weatherAus.csv' was outputted as 'rainfall.csv' after our pre-processing. The data set contains daily weather observations from numerous Australian weather stations. The daily observations are available from http://www.bom.gov.au/climate/data. Copyright Commonwealth of Australia 2010, Bureau of Meteorology.
The following is the data meaning types for the features used for Lab Two (19 variables by class type):
Codebook in its entirety can be found at https://www.kaggle.com/jsphyg/weather-dataset-rattle-package.
The original data set had 140,787 observations and 24 features. After pre-processing, dimension reduction occurred leaving the new data set at 101,328 observations and 19 features. Observations were removed if it contained any NaNs within its features. The following features were removed:
We also decided for classification/regression purposes, to convert the "Cardinal Directions" to degrees. For example, N':0, 'NNE':22.5, 'NE':45, 'NE':45, and so on.
The reduction of features will help make our proposed model simpler and mitigate the probabiilty of overfitting. Since the data set has over a hundred thousand observations, we had sufficient number of observations that removing the observations with NaNs would not interfere with our training and test set results (i.e. accuracy).
Further modeling and various evaluations will be required to better understand which measure are most appropriate for analyzing the results of our modeling.
This algorithm will be measured for effectiveness using a clustering algorithm in classifying rainfall throughtout Asutralia.
We will be using The Adjsuted Rand Score (ARS) and Calinski_Harabaz for validation.
The Rand score (RS) is measure of similarity between the predicted and true clustering. ARS is the RS "adjusted for chance" using this equation:
ARS = (RS - Expected_RS) / (max(RS) - Expected_RS)
An ARS of 0.0 indicates a random labeling of each points and a 1.0 score indicate a perfect labeling.
Calinski-Harabaz score or the Variance Ratio Criterion. The Calinski-Harabaz score is used when the true labels are not known. This score is a ratio between the within-cluster dispersion and the between-cluster dispersion. The higher the score the better.
Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.
import seaborn
import pandas
#Assign values to response variable, y, and explanatory variables, x.
rainfall = pandas.read_csv('rainfall.csv', index_col=0)
temp_rainfall = rainfall.copy()
# Visualising missing data:
seaborn.heatmap(temp_rainfall.isnull(),yticklabels=False,cbar=False,cmap='Reds_r')
#Variables 'Evaporation' and 'Sunshine' contained many missing values, too many to be imputed.
temp_rainfall = temp_rainfall.drop(['Evaporation', 'Sunshine'], axis = 1)
#Get name of all cities in the data frame.
l = list(temp_rainfall.Location.unique())
#Drop all observations with NaN's. These are values that could not be imputed using the above code.
temp_rainfall.dropna(subset = list(temp_rainfall), inplace = True)
#List all cities that were dropped
for i in l:
if i not in temp_rainfall.Location.unique():
print(i)
#'Date' and 'Location' variables not needed for prediction.
temp_rainfall = temp_rainfall.drop(['Date', 'Location'], axis = 1)
temp_rainfall.info()
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot')
plt.figure(figsize=(6,4))
plt.hist(rainfall['RainTomorrow'],bins=2,rwidth=0.8)
plt.xticks([0.25,0.75],['No Rain','Rain'])
plt.title('Frequency of No Rain and Rainy days\n')
print(rainfall['RainTomorrow'].value_counts())
As you can see from the image above, the data is skewed in the number days that it actually rained and the days it did not rain. There were over 3 times as many days that didn't rain as days that did. This is important to note because it changes the way that we evaluate our data when looking for key variables to use for the clustering.
import seaborn as sns
sns.pairplot(rainfall.dropna()
, vars=['MinTemp','MaxTemp', 'Rainfall' ,'WindGustDir','WindGustSpeed','WindDir9am','WindDir3pm','WindSpeed9am','WindSpeed3pm'
,'Humidity9am','Humidity3pm','Pressure9am','Pressure3pm','Cloud9am','Cloud3pm','Temp9am','Temp3pm' ,'RainToday']
, hue='RainTomorrow')
Additionally, looking at this matrix allowed us to see how many variables are similar to one another in relation. Based on this and lab 2, we were able to create a dataset with using RainToday, MinTemp, Humidity9am, and RainTomorrow.
#From lab 2 we were able to determine the three top features that contributed the most to rainfall prediction. We will use
#these three features to cluster the rain vs no rain.
selected_features = ['RainToday', 'MinTemp', 'Humidity9am', 'RainTomorrow']
feature_select=temp_rainfall[selected_features].copy()
feature_select.info()
import seaborn as sns
sns.pairplot(feature_select.dropna()
, vars=['RainToday', 'MinTemp', 'Humidity9am']
, hue='RainTomorrow')
Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertations.
For the clustering task, we chose to use the F1 calculation as our evaluation metrics. The F1 measure is the weighted average of Precision and Recall. The dataset we are using is not senstive to False Positive like that of a medical study and therefore it is not as important to use Precision and Recall alone. Accuracy is not the best choice when the classes are not close to equal in sizes and therefore not a good fit for this dataset. Class size is demonstrated below. F1 Calculation, 2(Recall Precision)/(Recall + Precision), will be used as our evaluation metric of choice.
This doesn't quite fit with Lab 3, but since we used information and data from lab 2 to help us in lab 3, we have included that information here.
Train and adjust parameters.
#Response variable is 'RainTomorrow'
y = temp_rainfall['RainTomorrow'].values
y_fs = feature_select['RainTomorrow'].values
#Remove response variable from dataframe
del temp_rainfall['RainTomorrow']
del feature_select['RainTomorrow']
#Everything else is the explanatory variables used in prediction.
x = temp_rainfall.values
x_fs = feature_select.values
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import sklearn.metrics
import time
def bench_k_means(estimator, name, data):
t0 = time.time()
y_hat=estimator.fit_predict(data)
print(name)
print("Adjusted Rand Score (-1 to 1): %f" %(sklearn.metrics.adjusted_rand_score(y, y_hat)))
print("Calinski-Harabaz Score (higher scores relates to better defined clusters): %f"
%(sklearn.metrics.calinski_harabaz_score(X=data, labels=y_hat)))
return y_hat
scl_obj = StandardScaler()
scl_obj.fit(x)
x_scaled = scl_obj.transform(x)
scl_obj.fit(x_fs)
x_fs_scaled = scl_obj.transform(x_fs)
bench_k_means(KMeans(init='random', n_clusters=2), name="random", data=x)
bench_k_means(KMeans(init='k-means++', n_clusters=2), name="k-means++", data=x)
bench_k_means(KMeans(init='random', n_clusters=2), name="random, scaled", data=x_scaled)
bench_k_means(KMeans(init='k-means++', n_clusters=2), name="k-means++, scaled", data=x_scaled)
bench_k_means(KMeans(init='random', n_clusters=2), name="random, featured selected", data=x_fs)
bench_k_means(KMeans(init='k-means++', n_clusters=2), name="k-means++, featured selected", data=x_fs)
bench_k_means(KMeans(init='random', n_clusters=2), name="random, featured selected, scaled", data=x_fs_scaled)
bench_k_means(KMeans(init='k-means++', n_clusters=2), name="k-means++, featured selected, scaled", data=x_fs_scaled)
from sklearn.neighbors import kneighbors_graph
import matplotlib.pyplot as plt
import numpy
n=500 #x=1, 10, 50, 100, 500
X2_knn_graph = kneighbors_graph(x_fs, n, mode='distance') # calculate distance to x nearest neighbors
N2 = X2_knn_graph.shape[0]
X2_4nn_distances = numpy.zeros((N2,1))
for i in range(N2):
X2_4nn_distances[i] = X2_knn_graph[i,:].max()
X2_4nn_distances = numpy.sort(X2_4nn_distances, axis=0)
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
plt.plot(range(N2), X2_4nn_distances, 'r.', markersize=2) #plot the data
plt.title('Dataset name: rainfall, sorted by neighbor distance')
plt.xlabel('rainfall, Instance Number')
plt.ylabel('rainfall, Distance to {0}th nearest neighbor'.format(n))
plt.grid()
Based on the plot above, we choose the epsilon value of 5. Based on that we will run DBSCAN again.
from sklearn.cluster import DBSCAN
labels=bench_k_means(DBSCAN(eps=3.5, min_samples=500), name="DBSCAN", data=x_fs)
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels))
print('Estimated number of clusters: %d' % n_clusters_)
Evaluate and Compare
Visualize Results
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# #############################################################################
#https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
X = StandardScaler().fit_transform(x_fs)
# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Clearly, with two clusters and 15 noise points, all datapoints fall into the two groups on the scale.
from sklearn.cluster import AgglomerativeClustering
data = x_fs
cls = DBSCAN(eps=3.5, min_samples=500)
cls.fit(data)
dbs_labels = cls.labels_
cls = KMeans(n_clusters=2, random_state=1)
cls.fit(data)
kmn_labels = cls.labels_
fig = plt.figure(figsize=(12,8))
title = ['DBSCAN','KMEANS']
for i,l in enumerate([dbs_labels,kmn_labels]):
plt.subplot(3,2,2*i+1)
plt.scatter(data[:, 0], data[:, 1]+numpy.random.random(data[:, 1].shape)/2, c=l, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('Humidity9am'), plt.ylabel('MinTemp')
plt.grid()
plt.title(title[i])
plt.subplot(3,2,2*i+2)
plt.scatter(data[:, 0], data[:, 2]+numpy.random.random(data[:, 1].shape)/2, c=l, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.xlabel('MinTemp'), plt.ylabel('Humidity9am')
plt.grid()
plt.title(title[i])
plt.tight_layout()
plt.show()
Summarize the Ramifications
DBSCAN()
Based on our results, more analysis is required with different variables of the dataset. The results we found were that there were never more than 2 clusters in all the different ways that we attempted to cluster the data. Additionally, several of the types of classification methods in Sklearn were not successful due to the nature and size of our dataset.
Hindsight may have led us to a different dataset, however, after investing so much time on this data for the Mini-lab and Lab 2, we decided to stick to this set and report our findings as is.
The most interesting findings were that the clustering was not as effecient as we thought it would be. We also found it interesting that there were never more than two clusters. Due to the fact that two of our variables in our final dataset were binary, this could have been the cause. However, we would still think that there would be more than two clusters in at least one combination of parameters.
Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?
You have free reign to provide additional analyses or combine analyses
In the interest of trying 'just one more thing' to see if we can get a better output, we will use the exceptional work section to test 2 or 3 other combinations of variables with our final dataset.
These sets of variables are not being chosen due to regression, dementionality reduction, or any other statistical measure, so we cannot use this to report on for our paper.
#http://localhost:8888/notebooks/Documents/GitHub/DataMiningNotebooks/09.%20Clustering%20and%20Discretization.ipynb
#checking the original file for completeness and renaming the file in order to 'experiment' with it.
temp_rainfall.info()
exceptional_rainfall = temp_rainfall.drop(['WindGustDir', 'MinTemp', 'MaxTemp', 'Pressure9am', 'Temp9am'], axis = 1)
exceptional_rainfall.info()
Due to visual correlations of MinTemp, MaxTemp, Temp9am, and Temp3pm, only Temp3pm remains in exceptional_rainfall. Additionally other 9am features were removed due to correlations with the matching 3pm data.Dropped WindGustDir because it wasn't significant.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
y = exceptional_rainfall['RainToday']
X = exceptional_rainfall[['Rainfall','WindGustSpeed','WindDir9am','WindDir3pm','WindSpeed9am','WindSpeed3pm',
'Humidity9am','Humidity3pm','Pressure3pm','Cloud9am','Cloud3pm','Temp3pm']]
cv = StratifiedKFold(n_splits=10)
clf = RandomForestClassifier(n_estimators=150,random_state=1)
acc = cross_val_score(clf,X,y=y,cv=cv)
print ("Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)
%%time
import pandas as pd
params = []
for n_rain in range(4,10):
# get the first clustering
cls_rain = KMeans(n_clusters=n_rain, init='k-means++',random_state=1)
cls_rain.fit(X)
newfeature_rain = cls_rain.labels_ # the labels from kmeans clustering
X = np.column_stack((X,pd.get_dummies(newfeature_rain)))
acc = cross_val_score(clf,X,y=y,cv=cv)
params.append((n_rain,acc.mean()*100,acc.std()*100)) # save state
print ("Clusters",n_rain,"Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)
n_rain=5
X = exceptional_rainfall[['Rainfall','WindGustSpeed','WindDir9am','WindDir3pm','WindSpeed9am','WindSpeed3pm',
'Humidity9am','Humidity3pm','Pressure3pm','Cloud9am','Cloud3pm','Temp3pm']]
cls_rain = KMeans(n_clusters=n_rain, init='k-means++',random_state=1)
cls_rain.fit(X)
newfeature_fare = cls_rain.labels_ # the labels from kmeans clustering
plt.figure()
X=X.values
plt.scatter(X[:, 0], X[:, 1]+np.random.random(X[:, 1].shape)/2, c=newfeature_rain, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.grid()
data = X
cls = DBSCAN(eps=0.125, min_samples=6)
cls.fit(data)
dbs_labels = cls.labels_
cls = KMeans(n_clusters=17, random_state=1)
cls.fit(data)
kmn_labels = cls.labels_
fig = plt.figure(figsize=(12,8))
title = ['DBSCAN','KMEANS']
for i,l in enumerate([dbs_labels,kmn_labels]):
plt.scatter(data[:, 0], data[:, 1]+np.random.random(data[:, 1].shape)/2, c=l, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.grid()
plt.title(title[i])
plt.scatter(data[:, 0], data[:, 2]+np.random.random(data[:, 1].shape)/2, c=l, cmap=plt.cm.rainbow, s=20, linewidths=0)
plt.grid()
plt.title(title[i])
plt.tight_layout()
plt.show()